Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine learning segmentation #62

Open
wants to merge 426 commits into
base: v0.1.7
Choose a base branch
from

Conversation

leonardosepulveda
Copy link
Collaborator

I changed the name of nuclei_segmentation_cv2 to "machine_learning_segmentation". I will keep the cv2 segmentation capability and add a new class to allow the implementation of machine learning approaches. I will implement ilastik, U-nets and cellpose to begin with. The user will need to provide a trained model that can be used to segment either nuclei or cells. I will add a test function to compare the results of all available methods in a few FOVs.

@pep8speaks
Copy link

pep8speaks commented Apr 23, 2020

Hello @leonardosepulveda! Thanks for updating this PR.

Line 293:72: W291 trailing whitespace

Line 139:9: E128 continuation line under-indented for visual indent
Line 184:22: E261 at least two spaces before inline comment
Line 184:22: E262 inline comment should start with '# '
Line 185:24: E261 at least two spaces before inline comment
Line 185:24: E262 inline comment should start with '# '
Line 186:23: E261 at least two spaces before inline comment
Line 186:23: E262 inline comment should start with '# '
Line 285:59: W291 trailing whitespace
Line 295:77: W291 trailing whitespace
Line 297:71: W291 trailing whitespace
Line 363:49: E221 multiple spaces before operator
Line 363:52: E222 multiple spaces after operator
Line 363:60: E222 multiple spaces after operator
Line 363:81: E501 line too long (93 > 80 characters)
Line 402:58: W291 trailing whitespace
Line 513:67: W291 trailing whitespace
Line 560:61: E251 unexpected spaces around keyword / parameter equals
Line 561:49: E127 continuation line over-indented for visual indent
Line 562:65: E251 unexpected spaces around keyword / parameter equals
Line 563:49: E127 continuation line over-indented for visual indent

Line 143:46: W292 no newline at end of file

Line 62:81: E501 line too long (83 > 80 characters)
Line 64:81: E501 line too long (91 > 80 characters)
Line 69:81: E501 line too long (86 > 80 characters)

Line 12:81: E501 line too long (83 > 80 characters)

Comment last updated at 2021-01-08 21:09:37 UTC

@merlin-pep8speaks
Copy link
Collaborator

Hello @leonardosepulveda! Thanks for updating this PR.

Line 141:67: W291 trailing whitespace
Line 142:72: W291 trailing whitespace
Line 143:72: W291 trailing whitespace
Line 153:1: W293 blank line contains whitespace

@codecov
Copy link

codecov bot commented Apr 23, 2020

Codecov Report

Merging #62 (0ac2e2e) into v0.1.7 (8e2cb01) will increase coverage by 0.16%.
The diff coverage is 83.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           v0.1.7      #62      +/-   ##
==========================================
+ Coverage   87.33%   87.49%   +0.16%     
==========================================
  Files          57       59       +2     
  Lines        5108     5343     +235     
==========================================
+ Hits         4461     4675     +214     
- Misses        647      668      +21     
Impacted Files Coverage Δ
merlin/merlin.py 74.28% <ø> (ø)
merlin/util/dataportal.py 87.70% <ø> (ø)
merlin/util/segmentation.py 78.97% <78.97%> (ø)
merlin/analysis/segment.py 90.95% <89.87%> (+1.15%) ⬆️
merlin/core/dataset.py 90.88% <100.00%> (+0.02%) ⬆️
test/conftest.py 98.46% <100.00%> (+0.04%) ⬆️
test/test_merfish_segmentation_cellpose.py 100.00% <100.00%> (ø)
test/test_merfish_segmentation_cv2.py 100.00% <100.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e2cb01...8d112f6. Read the comment docs.

@leonardosepulveda leonardosepulveda self-assigned this Jul 1, 2020
@leonardosepulveda
Copy link
Collaborator Author

I would like to merge this branch to master. I did tests and the new machine learning segmentation using cellpose is working, as well as the watershed CV2 implementation.However, there are some incompatibilities with merlin/utils/dataportal.py. How should I solve that issue?

@emanuega emanuega changed the base branch from master to v0.1.7 July 2, 2020 00:42
@emanuega
Copy link
Owner

emanuega commented Jul 2, 2020

What is the incompatible you're finding with dataportal.py? I prefer for all the raw data access to go through the dataportal so that it is easy to switch between AWS, GC, or local storage.

@leonardosepulveda
Copy link
Collaborator Author

leonardosepulveda commented Jul 2, 2020

I am trying to solve the issues that appear in workflow and ci. So far I understand the output from both so I can go back into the code and make modifications. However, now the build stops in the testing part, but it is unclear to me where the problem is:

#!/bin/bash -eo pipefail 
source ~/miniconda/bin/activate root 
conda activate base 
source activate merlin_env 
cd ~/MERlin 
mkdir ~/test-reports 
pytest --cov --cov-report=xml 
 
============================= test session starts ==============================
 
platform linux -- Python 3.6.10, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /home/circleci/MERlin
 plugins: cov-2.10.0
 collecting ... collecting 71 items                                                            collecting 96 items                                                            collected 96 items                                                             
 
test/test_barcode_database.py ..........                                 [ 10%]
 test/test_binary_utils.py ..                                             [ 12%]
test/test_codebook.py ...........                                        [ 23%]
test/test_core.py .....................                                  [ 45%]
test/test_dataorganization.py ..........                                 [ 56%]
 test/test_dataportal.py ............                                     [ 68%]
test/test_dataset.py ...                                                 [ 71%]
 test/test_image_reader.py .                                              [ 72%]
  
Too long with no output (exceeded 10m0s): context deadline exceeded 

Any insights on how to debug? other place where I should look?

@emanuega
Copy link
Owner

emanuega commented Jul 2, 2020

It looks like you are a few commits behind the master and are incorporating some of the changes independently. Can you rebase this branch onto the head of the master branch so we can get all the commits incorporated? (see https://medium.com/@gabriellamedas/git-rebase-and-git-rebase-onto-a6a3f83f9cce). Then we can see if this problem persists.

@leonardosepulveda
Copy link
Collaborator Author

leonardosepulveda commented Jul 6, 2020

Thanks for the tip on rebase. It seems that the things started to work well, but the problem comes again with the changes I added to dataportal.py to make it look like the updated branch. See below what I get:

[lsepulvedaduran@boslogin03 MERlin]$ git rebase master machine_learning_segmentation
First, rewinding head to replay your work on top of it...
Applying: Adding _get_membrane_mask and _get_nuclei_mask functions to WatershedSegmentNucleiCV2 class
...
Applying: add george's modifications to dataportal
Using index info to reconstruct a base tree...
M       requirements.txt
M       test/test_dataportal.py
Falling back to patching base and 3-way merge...
Auto-merging requirements.txt
Applying: adding additional modifications to aws
Using index info to reconstruct a base tree...
M       merlin/util/dataportal.py
Falling back to patching base and 3-way merge...
Auto-merging merlin/util/dataportal.py
CONFLICT (content): Merge conflict in merlin/util/dataportal.py
Failed to merge in the changes.
Patch failed at 0066 adding additional modifications to aws
The copy of the patch that failed is found in:
   /n/home06/lsepulvedaduran/Software/MERlin/.git/rebase-apply/patch

When you have resolved this problem, run "git rebase --continue".
If you prefer to skip this patch, run "git rebase --skip" instead.
To check out the original branch and stop rebasing, run "git rebase --abort".

[lsepulvedaduran@boslogin03 MERlin]$

I looked at the patch 0066 , this is what says:

_[lsepulvedaduran@boslogin03 rebase-apply]$ more 0066
From 60acd7246a1a06562dd50a0147ccfb3d810dd35e Mon Sep 17 00:00:00 2001
From: leonardosepulveda <[email protected]>
Date: Sat, 15 Feb 2020 09:38:38 -0500
Subject: adding additional modifications to aws

---
 merlin/util/dataportal.py | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/merlin/util/dataportal.py b/merlin/util/dataportal.py
index e3d0f6e832b69a6491bee876abd741b44844ddb9..9fcd4631bcb603feeeddee777b1f1d9b3f8ffb0e 100644
--- a/merlin/util/dataportal.py
+++ b/merlin/util/dataportal.py
@@ -2,9 +2,11 @@ import os
 import boto3
 import botocore
 from google.cloud import storage
+from google.cloud import exceptions
 from urllib import parse
 from abc import abstractmethod, ABC
 from typing import List
+from time import sleep


 class DataPortal(ABC):
@@ -340,12 +342,41 @@ class GCloudFilePortal(FilePortal):
         return GCloudFilePortal(
             self._exchange_extension(newExtension), self._client)

+    def _error_tolerant_access(self, request, attempts = 5, wait = 60):
+        while attempts > 0:
+            try:
+                file = self._fileHandle.download_as_string()
+                return file
+            except (exceptions.GatewayTimeout, exceptions.ServiceUnavailable):
+                attempts -= 1
+                sleep(60)
+                if attempts == 0:
+                    raise
+
     def read_as_text(self):
-        return self._fileHandle.download_as_string().decode('utf-8')
+         backoffSeries= [1,2,4,8,16,32,64,128,256]
+        for sleepDuration in backoffSeries:
+            try:
+                file = self._fileHandle.download_as_string().decode('utf-8')
+                return file
+            except (exceptions.GatewayTimeout, exceptions.ServiceUnavailable):
+                if sleepDuration == backoffSeries[-1]:
+                    raise
+                else:
+                    sleep(sleepDuration)

     def read_file_bytes(self, startByte, endByte):
-        return self._fileHandle.download_as_string(
-            start=startByte, end=endByte-1)
+        backoffSeries= [1,2,4,8,16,32,64,128,256]
+            for sleepDuration in backoffSeries:
+            try:
+                file = self._fileHandle.download_as_string(start=startByte,
+                                                           end=endByte-1)
+                return file
+            except (exceptions.GatewayTimeout, exceptions.ServiceUnavailable):
+                if sleepDuration == backoffSeries[-1]:
+                    raise
+                else:
+                    sleep(sleepDuration)

     def close(self) -> None:
         pass
--
1.8.3.1_

Any idea on how to proceed?

@HazenBabcock
Copy link
Contributor

If you open merlin/util/dataportal.py you should see some notes about merge conflicts, fix these and then you should be able to continue with the rebase.

@leonardosepulveda
Copy link
Collaborator Author

I think snakemake is hanging. I encountered this in my latest branch and updated the default snakemake parameters. Does 0f3a617 help?

I added the modifications to snakemakeParameters in [0f3a617], but the problem persists.

@leonardosepulveda
Copy link
Collaborator Author

I found the problem. It was in the test_merfish part. I added some tasks for the new segmentation, and the problem seems to come from there. I created independent tests for debugging.

@leonardosepulveda
Copy link
Collaborator Author

The build run to the end now but the new test files I created cannot be found, this is the complaints

FileNotFoundError: [Errno 2] No such file or directory: '/home/circleci/MERlin/test_analysis_parameters/test_analysis_segmentation_cellpose.json'
 
FileNotFoundError: [Errno 2] No such file or directory: '/home/circleci/MERlin/test_analysis_parameters/test_analysis_segmentation_cv2.json'

the folder test_analysis_parameters do not exist in my current branch, is there some data reorganization that I am missing?

@emanuega
Copy link
Owner

emanuega commented Jul 8, 2020

While running the tests it creates this directory and copies the files from the test/auxiliary files directory. You may have to update the base_files function in test/conftest.py to copy your test_analysis_segmentation json configurations into this folder.

@leonardosepulveda
Copy link
Collaborator Author

Thanks. Now the tests launch, but they are crashing for a reason I do not understand. I created the new tests by copying and modifying test_merfish.py and test_analysis_parameters.json. The current error is

______________________________ test_cv2_2d_local _______________________________

simple_merfish_data = <merlin.core.dataset.MERFISHDataSet object at 0x7f4636cba978>
     
    @pytest.mark.fullrun
    @pytest.mark.slowtest
    def test_cv2_2d_local(simple_merfish_data):
       	with open(os.sep.join([merlin.ANALYSIS_PARAMETERS_HOME,
                               'test_analysis_segmentation_cv2.json']), 'r') as f:
            snakefilePath = m.generate_analysis_tasks_and_snakefile(
                simple_merfish_data, f)
>           m.run_with_snakemake(simple_merfish_data, snakefilePath, 5)
 
test/test_segmentation_cv2.py:15: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  
merlin/merlin.py:194: in run_with_snakemake
    for t in dataSet.get_analysis_tasks()}
 
merlin/merlin.py:194: in <dictcomp>
    for t in dataSet.get_analysis_tasks()}
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  
 
self = <merlin.core.dataset.MERFISHDataSet object at 0x7f4636cba978>
analysisTaskName = 'RandomNumberParallelAnalysisTask'
    def load_analysis_task(self, analysisTaskName: str) \
            -> analysistask.AnalysisTask:
        loadName = os.sep.join([self.get_task_subdirectory(
            analysisTaskName), 'task.json'])
        print(loadName)
>       with open(loadName, 'r') as inFile:
E       FileNotFoundError: [Errno 2] No such file or directory: '/home/circleci/MERlin/test_analysis/merfish_test/RandomNumberParallelAnalysisTask/tasks/task.json'

merlin/core/dataset.py:620: FileNotFoundError

For what I understand that random number task is not called in the text_merfish code, so I do not know why it is complaining. I notice that the test that is run before the segmentation tests is test_plotting, which calls the randomNumber tasks. Any insights?

@leonardosepulveda
Copy link
Collaborator Author

leonardosepulveda commented Jul 9, 2020

I changed the name of the tests from test_segmentation... to test_merfish_segmentation... and the workflow works now. The only difference is that the tests run before test_plotting. Is this because the results of each analysis are in the same folder, and when trying the new analysis it reads previous analysis too? This seems to be a bug as I would not expect that the order of the execution should matter. Any ideas on how to solve it? maybe clean the folder after each test?

Now the codecov patch is the only part that is not passing. Most of the code I wrote is in segmentation.py., and has a coverage of 78%. I would need to create a 3D segmentation test to increase the coverage. Could you merge this branch to the main now or you want me to create 3D tests?

@shavan2600
Copy link

Good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants